Fun facts about Vacouver trees distributions

Foreword

This notebook will be showing exploratory data analysis for the subset of the Vancouver Street Trees dataset located here.

Introduction

Questions of interests

Let's import the subset of the Vancouver Street Trees data. Since this is a new dataset,let's take a good first step to get familiar with it by glancing at the values in the dataframe.

Next, let's check the type of data in each column and how many missing values there are.

From the above infomation,the datatype of date_planted is object, we need to parse dates as numbers. We can specify parse_dates=['date_planted'] to read_csv again.

Also, it looks like there are some NaNs in three of the columns, and the date_planted and cultivar_name seem to have the most: about half rows are missing a value.

Now we are parsing the dates and then we'll reprint the info of the dataset.

Visualizing missing values helps us identify potential issues with the data.

By visualizing the missing values for each column next to each other, we can quickly see if there are similar patterns between columns.From the above plot we find that the missing values from cultivar_name and date_planted are not exactly the same rows,although they both have about half rows missing a value.The column plant_area has only 1% rows missing a value.

Since cultivar_name and plant_area are categorical columns showing trees description information,we are not dropping these NaN values if we are not interested in them.For the column date_planted,we can drop the NaN values when we focus on the statistics related to the time. Considering almost half of rows missing a value in date_planted, we might keep the NaN values rather than drop them when we deal with time unrelated statistics.

Now let’s print out the summary statistics for the numerical columns.

Visualizing the distributions of all numerical columns helps us understand the data.

The first column unnamed:0 seems like the id for each row in the original dataset,we have not much interest in it when discovering the numerical columns relationships through visualization. We are going to ignore this column in the following numerical columns exploring.

This overview tells us that most trees have a diameter of less than 5 in, and height range id between 1 to 2. As trees get bigger and taller,the count numbers are going down.Also, the civic number and street blocks number seem to share the same distribution.

Repeating columns of both X and Y lets us effectively explore pairwise relationships between columns.

Unfortunately, these plots are saturated, so although we can see that there might be some correlative relationships, we should remake this plot as a 2D histogram heatmap.

From the above heatmaps, we find that diameter and height might have a positive relationship when diameter is less than 25 inches. Also,we can learn that civic number and block number are related to longitude and latitude and it provides some interesting aspects related to geographic distribution.

Besides, visualizing the counts of all categorical columns helps us understand the data.Considering some columns have too many values and here we just select a subset of categorical columns to explore.

We can learn that some distributions are interesting such as how trees were planted in different street sides and neighbourhoods.Now we are going to explore more fun aspects of the data further in the following exploratory visualizaions.

Exploratory Visualizations

Question 1: How are trees distributions over the years through different neighbourhoods in Vancouver?

From the above plot,we can easily find that most trees were planted in 1996,2002 and 2013. We are going to find out more about trees planted in different neighbourhood over these years.

From the above heatmap, we learn that most trees were planted in Hastings-Sunrise,Kensington-Cedar Cottage , Renfres-Collingwood,Sunset and Victoria-Fraserview from 1992 to 2002.

We find those neighbourhoods which planted most trees from 1992 to 2002 are also the areas with most trees nowadays.

Besides,we would like to make some observations about the tree heights distributions over the years as a bonus to question 1.

As time goes by, we find that trees planted in 1991 are either growing fastest or originally tallest and this is really interesting.We might find more about this in later exploration.

Question 2: How do trees differ among street sides in Vancouver?

To answer this question, we'll explore the relationship between average tree size and the neighbourhoods.

We find that trees planted on both sides of the street are bigger and taller than those planted in the middle of the street. Trees are usually smaller especially in the bike area. It makes sense when we are looking at the trees on the street we usually feel the same way as the above plot shows us.

Question 3: Which neighbourhood is surrounded with most big and tall trees in Vancouver?

Now we are exploring the most wonderful neighbourhoods where there are most aboundant giant tall trees.

From the above plots we find that Kitsilano, Dunbar,Fairview,Shaughnessy and Kerrisdale are these great neighbourhoods where there are most big and tall trees. It is facinating that these neighbourhoods are all in the Vancouver West area and usually have the highest housing price as well.

Now let's take a look at how the trees are distributed in these top neighbourhoods by subplots.

From these subplots Fairview has the most fairly distributed trees of different sizes just like its name "Fairview"! What a fun fact!

Using both the colour and marker size to indicate the count creates an effective visualization in the above plot.We can easily learn that diameter less that 5 and height range between 1 and 1.5 are the most poluplar size of the trees in Vancouver. The trees with the diameter between 5 and 10 and height range between 2 and 2.5 go to the second place.

Conclusion

From the above exploratory visualizations,we are going to keep exploring and focus on fun facts about tree distributions in the report. Some of these are inspired by the quick and dirty EDA plots in the introduction part.Some columns of interest are date_planted,neighbourhood_name,diameter,height_range_id and street_side_name.

During the exploration of the data, we find some interesting aspects that are more related to people's compelling impressions of the Vacouver city such as prestigious communities with more giant trees VS newly developing communities with more lately planted trees. We also explore some other fun facts like trees distribution could fit its neighbourhood name perfectly like "Fairview".

Here are basically five key types of graphs as following:

From a heatmap plot,we learn that most trees were planted in Hastings-Sunrise,Kensington-Cedar Cottage and other neighbourhoods in the east of Vancouver from 1992 to 2002.

Through simple bar plots we can find the contrast distribution aspects among different street sides in Vancouver.

First we use simple bar plots to find the top neighbourhoods aboundant with most giant and tall trees.Coincedentally they are all located in the west of Vancouver.Then we use histogram subplots faceted with top neighbourhoods, we find a more fun fact about the trees distribution.

Using both the colour and marker size to indicate the count creates an effective visualization in the circle plot. It is easy to find out the most popular range of tree size in Vancouver.

Through the first question exploration, we open another door to someting more interesting. Using a line and point plot, we can easily find trees planted in 1991 are either growing fastest or originally tallest because they are the tallest trees nowadays.